Average word length | # of sentences | Source |
---|---|---|
9.29 | 28 | 中国新石器时代契刻符号 |
41.55 | 29 | 衢州话 |
43.10 | 81 | 新石器时代的迹象中国 |
45.12 | 13 | 东南亚国家联盟 |
45.35 | 51 | 泰国省府 |
46.48 | 21 | Harry Nicolaides |
46.84 | 16 | 克里斯蒂娜 艾伐脱 |
49.21 | 11 | ISO/IEC 646 |
50.40 | 10 | 将进酒 |
50.42 | 37 | 科科斯群岛 |
51.42 | 30 | ISO/IEC 8859 |
54.45 | 18 | 玛丽咏 高帝娅 |
54.87 | 51 | 八思巴字 |
56.18 | 57 | 北榄坡府 |
56.29 | 12 | 潘迪华 |
56.95 | 24 | 呵叻府 |
58.63 | 16 | Korumburra (Victoria) |
59.80 | 10 | 射雕英雄传 |
60.00 | 31 | 帕克斯 (新南威尔士州) |
60.13 | 67 | 諺文 |
61.35 | 21 | 泰国旅游 |
61.82 | 22 | XUL |
61.98 | 37 | 碧差汶府 |
63.95 | 11 | 安娜 孛莱 |
64.64 | 47 | 罗皇节 |
65.26 | 84 | 科瑞斯特小丑 |
65.66 | 37 | 云 |
66.20 | 15 | 水晶棺材 (传说) |
66.69 | 30 | 依夫城堡 |
67.62 | 31 | 蒙古国 |
Average word length | # of sentences | Source |
---|---|---|
235.13 | 15 | 锦溪镇 |
185.46 | 13 | 汽车租赁 |
180.17 | 12 | 锡剧 |
177.23 | 40 | 张家港 |
171.67 | 57 | 新儒家 |
163.10 | 15 | 广义计算 |
160.00 | 24 | 慈溪市 |
153.88 | 12 | 羊绒 |
151.40 | 25 | 华士 |
150.24 | 17 | 吴语文学 |
147.72 | 15 | 西德 |
147.36 | 101 | 海门 |
139.31 | 13 | 个 |
139.19 | 16 | 海门山歌 |
137.27 | 45 | 南浔镇 |
136.64 | 11 | 江淮官话 |
133.85 | 41 | 朱元璋 |
130.26 | 47 | 台湾桃园国际机场 |
129.67 | 18 | 新昌 |
129.51 | 87 | 天主教上海教区 |
129.25 | 118 | 武汉大学 |
129.00 | 21 | 力学 |
125.92 | 13 | 本字 |
124.68 | 28 | 蒋经国 |
124.50 | 16 | 地理学 |
124.42 | 24 | 同里镇 |
124.24 | 17 | 模糊系统方法研究 |
124.12 | 17 | 文白异读 |
124.04 | 25 | 吴语分区 |
123.54 | 144 | 靖国神社 |
The problem addressed in this subsection (as well as the results) is similar to 6.4.1.1, but now we focus on average word length instead of average sentence length.
Measuring average word length strongly depends on tokenization. The usual tokenization might split the string “28.06.2005” into five parts “28 . 06 . 2005” of average length two. To avoid this, the number of words is counted as 1 + (number of blanks in the sentence).
select round(avg(length(sentence) / (1+ length(sentence) - length(replace(sentence," ","")))),2) as le, count(sentence) as cnt, source from sentences s, inv_so i, sources so where s.s_id=i.s_id and i.so_id=so.so_id group by source having cnt>=10 order by le limit 30;
6.4.2.2 Average logarithmic word rank for different sources
6.4.2.3 Sources consisting of many / few words with frequency 1
6.4.2.4 Sources with low / high average word length of rare words